Skip to content

fix(connection-manager): add per-address timeout to prevent slow addr…#3412

Merged
dozyio merged 7 commits intolibp2p:mainfrom
paschal533:fix/dial-queue-per-address-timeout
Apr 11, 2026
Merged

fix(connection-manager): add per-address timeout to prevent slow addr…#3412
dozyio merged 7 commits intolibp2p:mainfrom
paschal533:fix/dial-queue-per-address-timeout

Conversation

@paschal533
Copy link
Copy Markdown
Contributor

@paschal533 paschal533 commented Mar 16, 2026

Problem

When dialing a peer with multiple multiaddrs (e.g. [/ip4/10.2.0.2/tcp/9106, /ip4/127.0.0.1/tcp/9106]), the dialTimeout (default 10s) was shared across all address attempts sequentially. If the first address
hung, for example, a private IP in a different subnet whose TCP SYN is silently dropped, it consumed the entire timeout budget and subsequent reachable addresses were never tried.

Root cause

In dial-queue.ts, the batch-level signal (AbortSignal.timeout(dialTimeout)) was passed directly to every individual transportManager.dial() call. One slow address = all addresses fail.

Fix

Add addressDialTimeout (default 6_000 ms) to ConnectionManagerConfig. For each address attempt, a per-address signal is composed from the outer batch signal and a fresh
AbortSignal.timeout(addressDialTimeout):

const addressSignal = anySignal([signal, AbortSignal.timeout(this.addressDialTimeout)])
  • If the per-address timeout fires first → error is caught, dial continues to the next address
  • If the batch timeout fires first → signal.aborted is true → whole operation aborts as before
  • .clear() is called in finally to prevent listener leaks

The addressDialTimeout is configurable via connectionManager.addressDialTimeout for users on slower transports (WebRTC, satellite links) who need more time per address.

Tests

Added a describe('addressDialTimeout') suite with 5 tests:

  • Per-address timeout fires → next address is tried (verified by call tracking, not just timing)
  • Exact bug-report scenario: 3 hung addresses then 1 reachable → all 4 tried in order, completes in ~3×addressDialTimeout not dialTimeout
  • All addresses hit per-address timeout → AggregateError (not TimeoutError)
  • Batch timeout fires first → name === 'TimeoutError'
  • Fast addresses are unaffected (regression guard)

Closes #2368

@paschal533 paschal533 marked this pull request as ready for review March 16, 2026 15:51
@paschal533 paschal533 requested a review from a team as a code owner March 16, 2026 15:51
Comment thread packages/libp2p/src/connection-manager/constants.defaults.ts Outdated
Comment thread packages/libp2p/src/connection-manager/dial-queue.ts Outdated
Comment thread packages/libp2p/src/connection-manager/index.ts Outdated
@tabcat tabcat requested a review from dozyio March 19, 2026 11:01
@dozyio
Copy link
Copy Markdown
Collaborator

dozyio commented Mar 28, 2026

Thanks for this @paschal533

Kind of thinking parallel dial with limits (2 or 3 concurrent)... Maybe dial best address first via AddressSorter then after 250ms race the rest?

@achingbrain
Copy link
Copy Markdown
Member

then after 250ms race the rest?

If you mean dial them in parallel I would be careful here as this will consume a lot of resources, particularly if the node is doing, for example, DHT things where you expect most dials to fail.

This opens the node up to an amplification-style attack where a peer might have a peer record with lots of bogus or slow addresses so a single dial request causes the target node make many dials, all of which are doomed.

The current behaviour of dialling addresses sequentially was introduced partly because of the above, but also due to empirical observations that if the first address dialled is a "good" candidate (e.g. public, non-NATed, or supports hole punching) and it fails, all other addresses will likely fail too.

That aside, having a shorter per-address timeout signal in addition to an overall dial timeout seems like a reasonable approach.

@dozyio
Copy link
Copy Markdown
Collaborator

dozyio commented Mar 31, 2026

Thanks for the context @achingbrain

Comment thread packages/libp2p/src/connection-manager/constants.defaults.ts Outdated
Comment thread packages/libp2p/test/connection-manager/dial-queue.spec.ts Outdated
/**
* @see https://libp2p.github.io/js-libp2p/interfaces/index._internal_.ConnectionManagerConfig.html#addressDialTimeout
*/
export const ADDRESS_DIAL_TIMEOUT = 6_000
Copy link
Copy Markdown
Member

@tabcat tabcat Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default looks reasonable.

In the future wondering if it may be beneficial to add a smaller timeout at the raw connection layer. The intention would be to quickly timeout (~4 seconds) when the destination host is offline.

@paschal533 paschal533 force-pushed the fix/dial-queue-per-address-timeout branch from ed86ac1 to cc1c321 Compare April 1, 2026 11:36
- Fix JSDoc @see URL to point to public ConnectionManagerInit interface
  instead of internal ConnectionManagerConfig

- Replace timing-based assertion in the sequential-hang test with a
  signal-based one: listen for abort on each hung dial's options.signal
  and count aborts. Verifies the per-address AbortSignal was actually
  fired three times rather than measuring wall-clock elapsed time, which
  is flaky on loaded CI runners.
@paschal533
Copy link
Copy Markdown
Contributor Author

hey @tabcat, made both changes... fixed the JSDoc URL to point to ConnectionManagerInit instead of the internal. for the timing test, replaced the elapsed assertions with a signal-based approach and added an abort listener inside callsFake that increments dialsAborted, then assert expect(dialsAborted).to.equal(3). directly verifies the per-address signal actually fired for the 3 hung dials instead of relying on wall clock. also tested it in a real environment with 2 blackhole TCP servers (accept but never send data so the noise handshake hangs) + 1 real listener, connected in ~1200ms which is exactly 2 × addressDialTimeout, not the 10s batch timeout. works as expected

@dozyio dozyio merged commit 2151144 into libp2p:main Apr 11, 2026
48 of 49 checks passed
@tabcat tabcat mentioned this pull request Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dial queue timeout for multiple multiaddresses incorrectly used for each connection separately

4 participants